knitr::opts_chunk$set(echo = TRUE)
library(tidyverse)
library(lubridate)
library(viridis)
library(leaflet)
lat_long = read.csv("./data/lat-long.csv")

\(~\)

Motivation

The motivation of this project was to use the data, the New York City (NYC) Department of Health and Mental Hygience (DOHMH) Dog Licensing Dataset.This dataset spans various boroughs of NYC and will help to deepen our understanding of the current patterns and trends among canines in NYC.

\(~\)

Initial Questions

We sought to understand the current landscape of licensed dogs in New York City. Additionally, we were also interested in comparing these demographic factors among the five boroughs. Furthermore, we wanted to deepen our understanding on the current patterns of licensing between 2017 - 2019 and examine if there were variations by year or month of license. Finally, we sought to find ways of predicting a dog’s age based on the borough and the gender of the dog. Based on this information, our questions are:

  • How can we better understand and visualize dog distributions by borough in NYC?
  • What are the trends of dog licenses by year and/or month in NYC?
  • How can we predict dog age based on borough and gender of the dog?

\(~\)

Data

We utilized the NYC Dog Licensing dataset. All dog owners who reside in NYC are required by law to license their dogs. This is a publicly available dataset sourced from the Department of Health and Mental Hygiene (DOHMH) Dog Licensing System where owners are able to apply for and renew dog licenses. This dataset was created on September 19, 2017 and was last updated on July 25, 2019. There are 10 variables in this dataset with a raw number of 345,727 observations. The variables provided in the raw dataset are as follows:

  • RowNumber: Row Number
  • AnimalName: User-provided dog name (unless specified otherwise)
  • AnimalGender: M(Male) or F(Female) dog gender
  • AnimalBirthMonth: Month and year dog was born
  • BreedName: Dog Breed
  • Borough: Owner Borough
  • ZipCode: Owner Zip Code
  • LiscenceIssuedDate: Date the dog license was issued
  • LiscenceExpiredDate: Date the dog license expires
  • Extract Year: Year Data was extracted

\(~\)

Exploratory Analysis

\(~\)

Loading and Cleaning Data

lat_lon = read.csv("./data/lat-long.csv")

dogz = read.csv("./data/NYC_Dog_Licensing_Dataset.csv") %>% 
  janitor::clean_names() %>% 
  filter(breed_name != "Unknown",
         extract_year-animal_birth_month < 20) %>% 
  mutate(animal_name = str_to_sentence(animal_name),
         borough = ifelse(zip_code %in% c(10001:10282),"Manhattan", borough),
         borough = ifelse(zip_code %in% c(10301:10314), "Staten Island", borough),
         borough = ifelse(zip_code %in% c(10451:10475), "Bronx", borough),
         borough = ifelse(zip_code %in% c(11004,11101:11106,11109,11351,11354:11375,11377:11379,11385,11411:11423,11426:11430,11432:11436,11691:11694,11697), "Queens", borough),
         borough = ifelse(zip_code %in% c(11201,11203:11226,11228:11239,11241:11243,11249,11252,11256), "Brooklyn", borough),
         borough = as.factor(borough),
         breed_name = as.character(breed_name),
         license_issued_date = as.Date(license_issued_date, format = "%m/%d/%Y"),
         license_expired_date = as.Date(license_expired_date, format = "%m/%d/%Y"),
         license_length = year(license_expired_date) - year(license_issued_date),
         dog_age = 2019 - animal_birth_month,
         breed_name = ifelse(breed_name == "American Pit Bull Terrier/Pit Bull", "American Pit Bull Mix / Pit Bull Mix", breed_name),
         breed_name = ifelse(breed_name == "American Pit Bull Terrier/Pit Bull", "American Pit Bull Mix / Pit Bull Mix", breed_name),
         breed_name = ifelse(breed_name == "Australian Cattle Dog", "Australian Cattledog", breed_name),
         breed_name = ifelse(breed_name == "Bassett Hound", "Basset Hound", breed_name),
         breed_name = ifelse(breed_name == "Brittany", "Brittany Spaniel", breed_name),
         breed_name = ifelse(breed_name == "Coonhound, Black and Tan", "Black and Tan Coonhound", breed_name),
         breed_name = ifelse(breed_name == "Coonhound, Black and Tan", "Black and Tan Coonhound", breed_name),
         breed_name = ifelse(breed_name == "Coonhound, Blue Tick", "Bluetick Coonhound", breed_name),
         breed_name = ifelse(breed_name == "Coonhound, Treeing Walker", "Treeing Walker Coonhound", breed_name),
         breed_name = ifelse(breed_name == "Cotton de Tulear", "Coton de Tulear", breed_name),
         breed_name = ifelse(breed_name == "Japanese Chin/Spaniel", "Japanese Chin", breed_name),
         breed_name = ifelse(breed_name == "Jindo Dog, Korea", "Jindo", breed_name),
         breed_name = ifelse(breed_name == "Mastiff, French (Dogue de Bordeaux)", "Dogue de Bordeaux", breed_name),
         breed_name = ifelse(breed_name == "Pharoh hound", "Pharaoh Hound", breed_name),
         breed_name = ifelse(breed_name == "Schipperke", "Schipperkee", breed_name),
         breed_name = ifelse(breed_name == "Schnauzer, Miniature", "Miniature Schnauzer", breed_name),
         breed_name = ifelse(breed_name == "Schnauzer, Miniature Crossbreed", "Miniature Schnauzer, Crossbreed", breed_name),
         breed_name = ifelse(breed_name == "Welsh Corgi, Cardigan", "Cardigan Welsh Corgi", breed_name),
         breed_name = ifelse(breed_name == "Welsh Corgi, Pembroke", "Pembroke Welsh Corgi", breed_name),
         breed_name = ifelse(breed_name == "Bull Dog, French", "French Bulldog", breed_name),
         breed_name = ifelse(breed_name == "Collie, Bearded", "Bearded Collie", breed_name),
         breed_name = ifelse(breed_name == "Bullmastiff", "Mastiff, Bull", breed_name),
         breed_name = ifelse(breed_name == "Neapolitan Mastiff", "Mastiff, Neapolitan", breed_name),
         breed_name = ifelse(breed_name == "Tibetan Mastiff", "Mastiff, Tibetan", breed_name),
         breed_name = ifelse(breed_name == "Pointer, German Shorthaired", "German Shorthaired Pointer", breed_name),
         breed_name = ifelse(breed_name == "Pointer, German Wirehaired", "German Wirehaired Pointer", breed_name),
         breed_name = ifelse(breed_name == "Schnauzer, Giant", "Giant Schnauzer", breed_name),
         breed_name = ifelse(breed_name == "Schnauzer, Standard", "Standard Schnauzer", breed_name),
         breed_name = ifelse(breed_name == "Chinese Shar-Pei", "Shar-Pei, Chinese", breed_name)) %>% 
  drop_na(borough) %>% 
  rename(animal_birth_year = animal_birth_month, license_id = row_number) %>% 
  select(-extract_year)
dogz_1 =
  dogz %>% 
  distinct(animal_name, animal_gender, animal_birth_year, breed_name, .keep_all = TRUE) %>% 
  select (-license_issued_date, -license_expired_date, -license_length)

distinct_dogz =
  left_join(dogz_1, lat_lon, by = "zip_code") %>% 
  select(-City,-State,-Timezone,-Daylight.savings.time.flag,-geopoint) %>% 
  janitor::clean_names()

The dataset was cleaned and adjusted for this analysis. All dogs whose breed were listed as “unknown” were removed, as well as any dogs who were listed as over 20 years old at the time of their license. Originally there were no boroughs listed in this dataset, so we used the available zip codes to assign boroughs to each record. Records whose zip code did not correlate to a borough were dropped. A variable license_length was added to calculate the length in years of each license. There were slight differences in the way breed names were recorded, these were adjusted to be uniform.

Variables were changed to appropriate data type where necessary.

The main database included the same dog multiple times if she had more than one license. We created a subset to catalog only the distinct dogs present in our dataset. To identify individual dogs we matched records on animal name, gender, birth year, and breed name. We also added latitude and longitude data that corresponds to zip code for mapping purposes.

\(~\)

Gender of dogs by borough

plot_2=
  distinct_dogz %>% 
  filter(animal_gender == "M" | animal_gender == "F") %>% 
  ggplot(aes(x = borough, fill = animal_gender)) +
  geom_histogram(position = "dodge", stat = "count") +
  labs(
    title = "Gender of dogs by borough",
    x = "New York Borough",
    y = "Number of dogs"
  )

plot_2

Based on this bar graph plotting borough by frequency of dogs, we can see that overall, Manhattan has the most dogs (over 20,000 dogs for each gender) followed by Brooklyn, Queens, The Bronx, and Staten Island (with about 5,000 dogs for each gender). Additionally, among all boroughs there are more male dogs than there are female dogs.

\(~\)

Age Distribution of Dogs by Boroughs

plot_3=
  distinct_dogz %>% 
  filter(dog_age < 20) %>% 
  ggplot(aes(x = borough, y = dog_age)) +
  geom_boxplot (color = "purple") +
  labs(
    title = "Age distribution of dogs by borough",
    x = "New York Borough",
    y = "Dog's Age"
  )
plot_3

Based on this box and whisker plot, plotting borough dog’s age in years, we can see that overall, the median age for dogs is about 6 years. The Bronx, Brooklyen, and Queens have similar distributions which are right skewed while Manhattan has a smaller range of values. Staten Island has the oldest dogs ranging above 10 years with no outliers shown.

\(~\)

Top Breed of Dogs by Age

table2 = distinct_dogz %>% 
  group_by(dog_age) %>% 
  count(breed_name) %>% 
  top_n(3)
knitr::kable(table2)
dog_age breed_name n
1 Labrador Retriever Crossbreed 231
1 Shih Tzu 400
1 Yorkshire Terrier 569
2 American Pit Bull Mix / Pit Bull Mix 448
2 Shih Tzu 564
2 Yorkshire Terrier 813
3 American Pit Bull Mix / Pit Bull Mix 697
3 Shih Tzu 579
3 Yorkshire Terrier 754
4 American Pit Bull Mix / Pit Bull Mix 896
4 Shih Tzu 472
4 Yorkshire Terrier 548
5 American Pit Bull Mix / Pit Bull Mix 956
5 Chihuahua 529
5 Shih Tzu 507
6 American Pit Bull Mix / Pit Bull Mix 778
6 Chihuahua 542
6 Yorkshire Terrier 481
7 American Pit Bull Mix / Pit Bull Mix 619
7 Chihuahua 445
7 Shih Tzu 471
8 American Pit Bull Mix / Pit Bull Mix 526
8 Shih Tzu 404
8 Yorkshire Terrier 452
9 American Pit Bull Mix / Pit Bull Mix 439
9 Shih Tzu 431
9 Yorkshire Terrier 440
10 Chihuahua 383
10 Shih Tzu 393
10 Yorkshire Terrier 406
11 Chihuahua 368
11 Shih Tzu 381
11 Yorkshire Terrier 390
12 Chihuahua 333
12 Shih Tzu 364
12 Yorkshire Terrier 391
13 Chihuahua 260
13 Shih Tzu 310
13 Yorkshire Terrier 290
14 Chihuahua 238
14 Shih Tzu 270
14 Yorkshire Terrier 254
15 Chihuahua 230
15 Shih Tzu 188
15 Yorkshire Terrier 207
16 Chihuahua 147
16 Shih Tzu 160
16 Yorkshire Terrier 159
17 Chihuahua 105
17 Shih Tzu 141
17 Yorkshire Terrier 106
18 Chihuahua 80
18 Shih Tzu 90
18 Yorkshire Terrier 71
19 Chihuahua 52
19 Shih Tzu 76
19 Yorkshire Terrier 51
20 Chihuahua 25
20 Jack Russell Terrier 23
20 Shih Tzu 23
21 Jack Russell Terrier 10
21 Maltese 9
21 Shih Tzu 12
22 Bichon Frise 7
22 Chihuahua 4
22 Maltese 4
22 Poodle, Toy 5

In this table, we see the top 3 dog breeds for each age, dog ages 1 - 22. We can see that there are more larger dog breeds appearing as the top breed in some of the younger age categories. After 9 years, all of the top 3 dog breeds for each age consist of small dog breeds. This is consistent with literature findings that smaller dog breeds tend to live longer lives than larger dog breeds. In addition, we find that several dog breeds outshine others in being the top dog breed for many of the age categories including Chihuanhua, Yorkshire Terrier, and American Pit Bull Mix / Pit Bull Mix.

\(~\)

Top Breed of Dogs by Borough

plot_4=
distinct_dogz %>% 
  group_by(borough) %>% 
  count(breed_name) %>% 
  top_n(5) %>% 
  ggplot(aes(x = breed_name, y = n)) +
  geom_point() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  facet_grid(~borough, scales = "free", space = "free") +
  labs(
    title = "Top 5 breeds of dogs by borough",
    x = "Dog Breed",
    y = "Total Count"
  )
plot_4

Based on this plot, plotting dog breed by the total count, the most common dog breeds are American Pit Bull Mix/Pit Bull Mix, Chihuahua, Maltese, Shih Tzu, and Yorkshire Terrier. American Pit Bull Mix/Pit Bull Mix has the highest observed frequency in Brooklyn (n =2000) while the lowest in Staten Island (n = 600). Additionally, in Manhattan, Chihuahua’s, and Yorkshire Terriers have the highest observed frequency.

\(~\)

Number of licenses over time (license issue date vs. number of licenses)

plot_5.5=
  dogz %>% 
  mutate(license_issued_date = floor_date(as_date(license_issued_date), "month")) %>% 
  group_by(license_issued_date) %>% 
  summarize(n_obs = n()) %>% 
      ggplot(aes(x = license_issued_date, y = n_obs, color = n_obs)) + 
      geom_point() + geom_line() +
      labs(title = "Number of licenses over time", 
      x = "License issue date", 
      y = "Number of licenses")
plot_5.5

Based on this line graph, license issue date by the number of licenses, the highest frequency of licences were issued between 2016 - 2018 while the fewest number of licenses were issued from 2015-2016. This could be due to the fact that the license data was captured from 2017 - 2019, but there was still data with earlier license dates captured.

\(~\)

Number of licenses issued per month (liscence issue date vs. number of liscences)

plot_5.51=
  dogz %>% 
  mutate(license_issued_date = month(license_issued_date),
         license_issued_date = month.abb[license_issued_date],
         license_issued_date = factor(license_issued_date, c("Jan","Feb", "Mar", "Apr", "May", "Jun", "Jul","Aug", "Sep", "Oct","Nov", "Dec"), ordered = TRUE)) %>%
  group_by(license_issued_date) %>% 
  summarize(n_obs = n()) %>% 
      ggplot(aes(x = license_issued_date, y = n_obs, color = n_obs)) + 
      geom_point() + geom_line(group=1) +
      labs(title = "Number of licenses issued per month", 
      x = "License issue date", 
      y = "Number of licenses")
plot_5.51

Based on this line graph, there is an increasing trend present in regards to the number of licenses issued per month. From January - March there is a steep increase which increases more gradually between March through June. There is a spike from June to July and a plateau from July to August. From August, there is a steady decrease until November where we then see an increase in December. There could be a variety of factors to explain this phenomenon, but one might be that in the warmer months individuals are more likely to want to adopt/buy a dog where it is easier to upkeep a dog as opposed to the colder months across NYC.

\(~\)

Map

pal = colorNumeric(
  palette = "viridis",
  domain = distinct_dogz$zip_code)

distinct_dogz %>% 
  group_by(breed_name) %>% 
  mutate(click_label = str_c("<b> year(s)", dog_age, "</b><br>", animal_birth_year, "born <br>", animal_gender, " gender")) %>% 
  leaflet() %>% 
  addProviderTiles(providers$CartoDB.Positron) %>% 
  addCircleMarkers(~longitude, ~latitude, radius = .1, color = ~pal(zip_code), popup = ~click_label)

\(~\)

Additional Analysis

#create reference groups that are coded as 
distinct_dogz_reg = 
  distinct_dogz %>% 
  mutate(
    borough = fct_infreq(borough),
    animal_gender = fct_infreq(animal_gender),
    breed_name = fct_infreq(breed_name)
    )
#run linear regression with dog age as Y and borough, breed name and animal gender as categorical
fit = lm(dog_age ~ borough + animal_gender, data = distinct_dogz)

#view outputs
#fit %>% 
#  broom::tidy()

#create tidy table
fit %>% 
  broom::tidy() %>% 
  select(term, estimate, p.value) %>% 
  knitr::kable(digits = 3)
term estimate p.value
(Intercept) 16.541 0.000
boroughBrooklyn 0.051 0.265
boroughManhattan 0.520 0.000
boroughQueens 0.473 0.000
boroughStaten Island 1.247 0.000
animal_genderF -9.717 0.000
animal_genderM -9.892 0.000
#graph residuals for each borough
distinct_dogz_reg %>% 
  modelr::add_residuals(fit) %>% 
  ggplot(aes(x = borough, y = resid)) + geom_violin(fill = "pink") +
  labs(title = "Graph of Residuals for each Borough", 
    x = "Borough", 
    y = "Residuals")

In our linear regression model we predict our outcome of dog’s age given our main exposure of borough, adjusting for dog’s gender. Our beta values show that there is not much variance among dog age across different boroughs for both male and female dogs. Additionally, as seen in our residuals plot, our data does not likely contain outliers and appears to be normally distributed. One limitation to our model is the exclusion of breeds type as a variable in our model. There is a theory that small dogs tend to live longers lives, thus in an ideal model we would have controlled for dog breed by categorizing breed size as small, medium, large breed. Unfortunately, the large number of breed types included in our dataset impeded our ability to categorizing each breed, and without categorizing this variable our degrees of freedom would be far too high for our model to be parsimonious. For these reasons we chose to exclude this variable from our model.

\(~\)

Discussion